“I never had a problem reaching a decision based on imperfect information. That’s just the way the world works.” ― Alex Ferguson, Leading: Learning from Life and My Years at Manchester United1
Legendary football manager Sir Alex Ferguson surely never had a problem in managing Manchester United from 1986 to 2013. However, for other clubs who try to win the English Premier League, they may have to rely on other insights to achieve this goal.
Principal component analysis will be one of the keys to answer this question.
Abstract
The season 2017-2018 statistics is downloaded from DataHub. These data contain the results of 380 EPL matches. There are total 22 variables with 12 variables measuring the team play statistics. Below is the descriptive information of the dataset.
| Field.Name | Order | Type..Format. | Description |
|---|---|---|---|
| Date | 1 | date (%Y-%m-%d) | Match Date (dd/mm/yy) |
| HomeTeam | 2 | string (default) | Home Team |
| AwayTeam | 3 | string (default) | Away Team |
| FTHG | 4 | integer (default) | Full Time Home Team Goals |
| FTAG | 5 | integer (default) | Full Time Away Team Goals |
| FTR | 6 | string (default) | Full Time Result (H=Home Win, D=Draw, A=Away Win) |
| HTHG | 7 | integer (default) | Half Time Home Team Goals |
| HTAG | 8 | integer (default) | Half Time Away Team Goals |
| HTR | 9 | string (default) | Half Time Result (H=Home Win, D=Draw, A=Away Win) |
| Referee | 10 | string (default) | Match Referee |
| HS | 11 | integer (default) | Home Team Shots |
| AS | 12 | integer (default) | Away Team Shots |
| HST | 13 | integer (default) | Home Team Shots on Target |
| AST | 14 | integer (default) | Away Team Shots on Target |
| HF | 15 | integer (default) | Home Team Fouls Committed |
| AF | 16 | integer (default) | Away Team Fouls Committed |
| HC | 17 | integer (default) | Home Team Corners |
| AC | 18 | integer (default) | Away Team Corners |
| HY | 19 | integer (default) | Home Team Yellow Cards |
| AY | 20 | integer (default) | Away Team Yellow Cards |
| HR | 21 | integer (default) | Home Team Red Cards |
| AR | 22 | integer (default) | Away Team Red Cards |
The EPL has 20 clubs and each club will play the others twice in the season, once at their home stadium and once at that of their opponents’, for 38 games. Therefore the total number of records are 20 x 19 of 380 with 12 independent variables, which makes up 4,560 data points. The analysis can be easily extended to include other seasons. However, for simplicity, our study just use Season 2017 - 2018.
Game play statistics are independent variables to explain the game, i.e. variance of the game, while the output are game results. Only variables from 11 to 23 are used for principal component analysis. Besides, the analysis is based on HomeTeam data. AwayTeam can be done in the same way.
set.seed(123456)
d <- read.csv("season-1718_csv.csv", stringsAsFactors = FALSE)
d1 <- d %>%
group_by(HomeTeam) %>%
summarise(
HS = sum(HS),
AS = sum(AS),
HST = sum(HST),
AST = sum(AST),
HF = sum(HF),
AF = sum(AF),
HC = sum(HC),
AC = sum(AC),
HY = sum(HY),
AY = sum(AY),
HR = sum(HR),
AR = sum(AR)
)
league.df <- column_to_rownames(d1, var = "HomeTeam")
(league.df)## HS AS HST AST
## Min. :189.0 Min. :106.0 Min. : 55.00 Min. :37.0
## 1st Qu.:210.8 1st Qu.:182.0 1st Qu.: 65.75 1st Qu.:65.0
## Median :231.5 Median :217.0 Median : 73.00 Median :74.5
## Mean :258.1 Mean :206.2 Mean : 87.90 Mean :71.5
## 3rd Qu.:291.5 3rd Qu.:232.0 3rd Qu.: 99.75 3rd Qu.:81.0
## Max. :359.0 Max. :264.0 Max. :151.00 Max. :97.0
## HF AF HC AC
## Min. :165.0 Min. :152.0 Min. : 80.00 Min. : 47.00
## 1st Qu.:179.2 1st Qu.:191.2 1st Qu.: 93.75 1st Qu.: 77.25
## Median :191.5 Median :201.0 Median :103.00 Median : 88.00
## Mean :194.2 Mean :199.1 Mean :109.05 Mean : 86.35
## 3rd Qu.:209.0 3rd Qu.:213.5 3rd Qu.:126.25 3rd Qu.: 95.25
## Max. :234.0 Max. :237.0 Max. :148.00 Max. :128.00
## HY AY HR AR
## Min. :18.00 Min. :16.00 Min. :0.00 Min. :0.0
## 1st Qu.:23.75 1st Qu.:24.50 1st Qu.:0.00 1st Qu.:0.0
## Median :27.50 Median :31.00 Median :1.00 Median :1.0
## Mean :28.10 Mean :29.75 Mean :0.85 Mean :1.1
## 3rd Qu.:31.75 3rd Qu.:35.25 3rd Qu.:1.25 3rd Qu.:2.0
## Max. :41.00 Max. :40.00 Max. :3.00 Max. :4.0
ggplot(stack(league.df), aes(x = ind, y = values)) +
geom_boxplot(fill = "#3d195b", colour = "red") +
labs(
x = "Variables",
y = "Values",
title = "Boxplot for 12 variables"
)Some facts can be concluded from the boxplot:
Some facts can be concluded from the correlation plot:
The highest negative correlation -0.8 AS to HS. If one team controls the game, the attack is sereve and it can turn the opponent defensive.
Shoot on Target will lead to more more Corner Kick, so AST is positively related to AC, with 0.81.
Home Foul is negatively related to Home Shot, which can be interperted as better play with better sport manner.
While Away Foul is not strongly correlated to other factors. Red Card factor is comparable between Home Team and Away Team, which can be interperted as Red Card is an event not related to attack or defence statistics, maybe it is more a referee related issue.
All Home-related and Away-related factors are negatively correlated to each other, which is a resonable representation.
The theory behind principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of correlated variables, while preserving as much as information present in the data set. To achieve this goal, a new set of variables, the principal components (PCs), are constructed by transforming from the original variables. The PCs are uncorrelated and sorted by the highest variance explained to the lowest. To illustrate, if there are 5 PCs, PC1 will be the first principal component that explained the most variance of the original variables’ covariance matrix.
For PCA, prcomp is used because according to literature2, prcomp uses singular value decomposition which is generally the preferred method for numerical accuracy.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.427 1.1648 1.1065 1.03176 0.9992 0.74891 0.64172
## Proportion of Variance 0.491 0.1131 0.1020 0.08871 0.0832 0.04674 0.03432
## Cumulative Proportion 0.491 0.6041 0.7061 0.79483 0.8780 0.92476 0.95908
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.47133 0.34027 0.27169 0.24999 0.1296
## Proportion of Variance 0.01851 0.00965 0.00615 0.00521 0.0014
## Cumulative Proportion 0.97759 0.98724 0.99339 0.99860 1.0000
fviz_screeplot(pca,
addlabels = TRUE, ylim = c(0, 50),
main = "Screeplot of the first 10 PCs",
barfill = "#3d195b", barcolor = "#3d195b"
)The first five PCs explain 87.80% of total variance.
By the below plot, PCs beyond PC5 have little significance to the analysis.
# Total cos2 of variables on PC1 and PC2
fviz_cos2(pca,
choice = "var", axes = 1:2,
title = "Sum of independent variables in PC1 and PC2",
fill = "#3d195b", color = "#3d195b"
)Since the eigenvalues beyond PC5 are significantly less than 1, according to Kaiser Rule, which mean they are explaining less variance than one independent variable. Therefore, total 5 PCs will be used.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## HS 14.38937 1.51669 0.02446 0.53439 0.78434
## AS 14.13506 1.40576 0.00005 0.01467 5.86522
## HST 13.50251 2.83039 0.12417 0.37533 2.69839
## AST 13.75937 0.16770 0.56088 0.12155 0.95727
## HF 6.36282 15.78376 0.77463 0.30650 13.88877
## AF 0.46109 5.30076 53.92940 1.46386 14.90660
## HC 14.03100 1.98957 0.09251 0.65872 0.00659
## AC 12.42937 3.62713 2.10217 1.06416 1.07972
## HY 6.87327 0.42215 2.87897 6.26729 37.96259
## AY 3.33618 21.41235 32.88986 0.66215 0.19191
## HR 0.07654 0.63199 1.02227 88.50718 1.05281
## AR 0.64341 44.91173 5.60064 0.02419 20.60579
The 1/12 of 8.125% level is represented by the red dash line.
# Contributions of variables to PC1
fviz_contrib(pca, choice = "var", axes = 1, top = 12, fill = "#3d195b", color = "#3d195b")PC1 is attacked related, as high loadings are Home Team Shots, Away Team Shots, Home Team Shots on Target, Away Team Shots on Target, Home Team Corners and Away Team Corners, which explains 49.10% of variance. It is reasonable as attack is the best way to win a game in soccer, so as to explaining the game.
# Contributions of variables to PC2
fviz_contrib(pca, choice = "var", axes = 2, top = 12, fill = "#3d195b", color = "#3d195b")PC2 is a referre statistics, which seems referre related issue since both Home and Away Team are involved in the statistics. Major contributions are Home Team Fouls Committed, Away Team Yellow Cards and Away Team Red Cards. PC3 explains 11.31% variance.
# Contributions of variables to PC3
fviz_contrib(pca, choice = "var", axes = 3, top = 12, fill = "#3d195b", color = "#3d195b")PC3 is Away Team foul play statistics. High loadings are Away Team Fouls Committed and Away Team Yellow Cards. PC3 explains 10.20% variance.
# Contributions of variables to PC4
fviz_contrib(pca, choice = "var", axes = 4, top = 12, fill = "#3d195b", color = "#3d195b")PC4 is Home Team Red Cards statistics. PC4 explains 8.87% variance.
# Contributions of variables to PC5
fviz_contrib(pca, choice = "var", axes = 5, top = 12, fill = "#3d195b", color = "#3d195b")PC5 seems representing Home Team Advantage. The reason is the loadings of Home Team Fouls Committed and Away Team Fouls Committed are similar, but Away Team Red Cards is much higher than Home Team Yellow Cards. PC5 explains 8.32% variance. The correlation circle in Section 4.1provides graphical repsentation of this PC.
The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams 2010).
fviz_pca_var(pca,
col.var = "cos2",
gradient.cols = c("black", "blue", "red"),
repel = TRUE # Avoid text overlapping
)From the plot, positive correlated variables will group together. ‘Attack’ attributes are grouped together.
One interesting observation is home team fouls are positively related to yellow cards while negative related to red cards. Away teams are vice versa. This can be easily related to home advantage of field games. Referees may be more inclined to give minor punishment to home team while away teams have higher chance to get red cards for foul play. This is a strong support to PC5.
Below is a series of plots showing each club contribution in PCs
fviz_contrib(pca,
choice = "ind",
axes = 1,
fill = "#3d195b",
color = "#3d195b",
title = "Total Contribution of clubs to PC1"
)fviz_contrib(pca,
choice = "ind",
axes = 2,
fill = "#3d195b",
color = "#3d195b",
title = "Total Contribution of clubs to PC2"
)fviz_contrib(pca,
choice = "ind",
axes = 3,
fill = "#3d195b",
color = "#3d195b",
title = "Total Contribution of clubs to PC3"
)fviz_contrib(pca,
choice = "ind",
axes = 4,
fill = "#3d195b",
color = "#3d195b",
title = "Total Contribution of clubs to PC4"
)fviz_contrib(pca,
choice = "ind",
axes = 5,
fill = "#3d195b",
color = "#3d195b",
title = "Total Contribution of clubs to PC5"
)Biplot is a combination of row data to PCs. Biplot visualize the data by assigning the PC1 and PC2 to X and Y Axis of Scatter chart like below.
fviz_pca_biplot(pca,
repel = TRUE,
col.var = "blue", # Variables color
col.ind = "#3d195b" # Individuals color
) Blue arrows start from origin are variables. Each club is shown as dot coming from the original rows.
From the analysis, below are some major findings.
Manchester City and Chelsea are high in PC1, which translated to good home performance. Final standings of Chelsea was 5 and Manchester City was the champion. So strong home performance is a must in winning the league.
Low number in Foul can divide the club ranking. Left side clubs, Liverpool, Tottenham, Manchester City, Chelsea and Arsenal, are top 6 in final ranking. Manchester United seems an outliner in the elite group as it was the first runner-up but located at the middle of PC1. North-east region clubs, including West Bromwich Albion, Swansea City and Stoke City are high in PC2 which means more foul. Stoke City got high Home Yellow Cards too. Besides, these three teams are not good in both Home and Away attacks. According to the final ranking, these three teams performed worse and relegated in Season 2018-2019.
(Loadings <- pca$rotation[, 1:5] %>%
round(2) %>%
data.frame() %>%
mutate(Attribute = rownames(.)) %>%
select(Attribute, everything()) %>%
arrange(PC1))d$PC1 <- pca$x[, 1]
d$PC2 <- pca$x[, 2]
d %>%
ggplot() +
geom_line(aes(x = as.Date(Date), y = PC1, group = 1), show.legend = F, colour = "#3d195b") +
geom_hline(yintercept = 0, colour = "red") +
facet_wrap(~HomeTeam) +
labs(
x = "Date",
y = "PC1",
title = "PC1 time series by clubs"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(date_labels = "%Y (%b)")Manchester City kept PC1 mean above 0 all the time.
d %>%
ggplot() +
geom_line(aes(x = as.Date(Date), y = PC2, group = 1), show.legend = F, colour = "#3d195b") +
geom_hline(yintercept = 0, colour = "red") +
facet_wrap(~HomeTeam) +
labs(
x = "Date",
y = "PC2",
title = "PC2 time series by clubs"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(date_labels = "%Y (%b)")d %>%
group_by(HomeTeam) %>%
summarise(
Index.1 = mean(PC1),
Index.2 = mean(PC2)
) %>%
arrange(desc(Index.1)) %>%
select(HomeTeam, Index.1, Index.2)d %<>% group_by(HomeTeam) %>%
arrange(Date) %>%
mutate(
M.A.PC1 = rollmeanr(PC1, 15, fill = NA),
M.A.PC2 = rollmeanr(PC2, 15, fill = NA)
)
Index.1718 <- d %>%
group_by(HomeTeam) %>%
filter(Date == max(Date)) %>%
select(HomeTeam, M.A.PC1, M.A.PC2, Date) %>%
arrange(M.A.PC1)
ggplot(Index.1718, aes(x = M.A.PC1, y = M.A.PC2)) +
geom_point(show.legend = F, colour = "#3d195b", size = 5) +
geom_text(aes(label = HomeTeam),
check_overlap = TRUE, nudge_y = 0.08,
show.legend = F, size = 2.9
) +
labs(x = "PC1", y = "PC2")For complete demonstration, Away Team PCA is carried out as below.
d2 <- d %>%
group_by(AwayTeam) %>%
summarise(
HS = sum(HS),
AS = sum(AS),
HST = sum(HST),
AST = sum(AST),
HF = sum(HF),
AF = sum(AF),
HC = sum(HC),
AC = sum(AC),
HY = sum(HY),
AY = sum(AY),
HR = sum(HR),
AR = sum(AR)
)
away.df <- column_to_rownames(d2, var = "AwayTeam")
(away.df)## HS AS HST AST
## Min. :132.0 Min. :122.0 Min. : 47.00 Min. : 35.00
## 1st Qu.:230.8 1st Qu.:169.8 1st Qu.: 79.00 1st Qu.: 60.75
## Median :265.5 Median :199.0 Median : 90.00 Median : 68.50
## Mean :258.1 Mean :206.2 Mean : 87.90 Mean : 71.50
## 3rd Qu.:299.5 3rd Qu.:241.0 3rd Qu.: 96.75 3rd Qu.: 87.25
## Max. :335.0 Max. :317.0 Max. :123.00 Max. :110.00
## HF AF HC AC
## Min. :141.0 Min. :172.0 Min. : 56.0 Min. : 55.00
## 1st Qu.:186.8 1st Qu.:182.5 1st Qu.: 89.0 1st Qu.: 72.00
## Median :194.5 Median :201.0 Median :113.0 Median : 86.00
## Mean :194.2 Mean :199.1 Mean :109.0 Mean : 86.35
## 3rd Qu.:202.0 3rd Qu.:215.2 3rd Qu.:127.2 3rd Qu.:102.00
## Max. :232.0 Max. :226.0 Max. :153.0 Max. :146.00
## HY AY HR AR
## Min. :13.00 Min. :17.00 Min. :0.00 Min. :0.00
## 1st Qu.:24.75 1st Qu.:25.50 1st Qu.:0.00 1st Qu.:0.75
## Median :27.00 Median :31.00 Median :1.00 Median :1.00
## Mean :28.10 Mean :29.75 Mean :0.85 Mean :1.10
## 3rd Qu.:34.25 3rd Qu.:34.25 3rd Qu.:1.00 3rd Qu.:2.00
## Max. :40.00 Max. :38.00 Max. :3.00 Max. :3.00
ggplot(stack(away.df), aes(x = ind, y = values)) +
geom_boxplot(fill = "#3d195b", colour = "red") +
labs(
x = "Variables",
y = "Values",
title = "Boxplot for 12 variables"
)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.3100 1.4542 1.1921 0.96504 0.81775 0.76849
## Proportion of Variance 0.4447 0.1762 0.1184 0.07761 0.05573 0.04922
## Cumulative Proportion 0.4447 0.6209 0.7393 0.81692 0.87264 0.92186
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.59752 0.53131 0.38977 0.30503 0.22007 0.07058
## Proportion of Variance 0.02975 0.02352 0.01266 0.00775 0.00404 0.00042
## Cumulative Proportion 0.95161 0.97514 0.98780 0.99555 0.99958 1.00000
fviz_pca_var(pca2,
col.var = "cos2",
gradient.cols = c("black", "blue", "red"),
repel = TRUE # Avoid text overlapping
)d$PC1a <- pca2$x[, 1]
d$PC2a <- pca2$x[, 2]
d %>%
ggplot() +
geom_line(aes(x = as.Date(Date), y = PC1a, group = 1), show.legend = F, colour = "#3d195b") +
geom_hline(yintercept = 0, colour = "red") +
facet_wrap(~HomeTeam) +
labs(
x = "Date",
y = "PC1a",
title = "PC1a time series by clubs (Away)"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(date_labels = "%Y (%b)")d %>%
ggplot() +
geom_line(aes(x = as.Date(Date), y = PC2a, group = 1), show.legend = F, colour = "#3d195b") +
geom_hline(yintercept = 0, colour = "red") +
facet_wrap(~AwayTeam) +
labs(
x = "Date",
y = "PC2a",
title = "PC2a time series by clubs (Away)"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(date_labels = "%Y (%b)")d %>%
group_by(AwayTeam) %>%
summarise(
Index.1 = mean(PC1a),
Index.2 = mean(PC2a)
) %>%
arrange(desc(Index.1)) %>%
select(AwayTeam, Index.1, Index.2)Huddersfield Town3 is an outliner. Final standing was not high but got good home performance. Part of the reasons maybe the turnaround matches in the League, which drew big clubs like Chelsea and Manchester United.
Principal componenet analysis can sort out the winning elements and help club managers to run their clubs.
Principal component analysis can represent infomration in a lower dimension which can make analysis easier to handle and find out different aspects of factors. If analysers focus on first few PCs, they can make a model with better performance.
Principal Component Analysis (PCA) with FactoMineR(decathlon dataset)François Husson & Magalie Houée-Bigot, http://factominer.free.fr/course/doc/RMarkdown_PCA_Decathlon.pdf
Principal Component Analysis, 2nd ed., I.T. Jolliffe, Springer.
Leading: Learning from Life and My Years at Manchester United by Alex Ferguson, Michael Moritz (With). ISBN 9780316268080↩
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/↩
https://www.90min.com/posts/6063114-end-of-season-review-huddersfield-town-report-card-from-the-2017-18-campaign↩